2  Data frames

Author

Vladimir Buskin

2.1 Data frames in Base R

  • Data frames are comparable to spreadsheets. They can be built by combining multiple vectors and applying the data.frame() function to them.
# Define the columns
participant <-  c("louis", "paula", "vincenzo")
score <- c(67, 85, 32)

# Combine the columns into a data frame
mydf <- data.frame(participant, score)
mydf
output
  participant score
1       louis    67
2       paula    85
3    vincenzo    32
  • You can check the structure of the data frame with str().
str(mydf)
output
'data.frame':   3 obs. of  2 variables:
 $ participant: chr  "louis" "paula" "vincenzo"
 $ score      : num  67 85 32
  • The individual columns of a data frame can be accessed using the $ symbol. The output is a vector:
mydf$participant
output
[1] "louis"    "paula"    "vincenzo"
  • Just like with vectors, it is possible to access specific elements of a data frame:
# First row
mydf[1,]
output
  participant score
1       louis    67
# Second column
mydf[,2]
output
[1] 67 85 32
# First two rows
mydf[1:2,]
output
  participant score
1       louis    67
2       paula    85
# First column, second entry
mydf[,1][2]
output
[1] "paula"
  • Extract only those cells where a certain property applies:
# Extract the row for the participant Vincenzo
mydf[mydf$participant == 'vincenzo',]
output
  participant score
3    vincenzo    32
# Further subset this data frame to obtain Vincenzo's score only
mydf[mydf$participant == 'vincenzo',]$score
output
[1] 32

2.2 Basics of R III: Data frames in the tidyverse

  • Load the tidyverse library to gain access to a large set of data manipulation and visualisation tools. If you have not installed it yet, uncomment the first line of the following code and run it.
#install.packages("tidyverse")
library("tidyverse")
  • Repeat the same procedure for the packages readxl and writexl to be able to read from and write to Microsoft Excel (.xlsx) files.
#install.packages("readxl")
library("readxl")
#install.packages("writexl")
library("writexl")
  • Download the large spreadsheet file NYC_flights_data.xlsx from ILIAS and place it in your data folder. Now import the dataset (cf. instructions in the script file) and store it in a variable flights.

2.2.1 Rows

2.2.1.1 filter()

  • Show flights where the departure was delayed by more than 120 minutes:
flights %>% # pass the "flights" data frame on to the next function
  filter(dep_delay > 120) # apply filter
output
# A tibble: 9,723 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      848           1835       853     1001           1950
 2  2013     1     1      957            733       144     1056            853
 3  2013     1     1     1114            900       134     1447           1222
 4  2013     1     1     1540           1338       122     2020           1825
 5  2013     1     1     1815           1325       290     2120           1542
 6  2013     1     1     1842           1422       260     1958           1535
 7  2013     1     1     1856           1645       131     2212           2005
 8  2013     1     1     1934           1725       129     2126           1855
 9  2013     1     1     1938           1703       155     2109           1823
10  2013     1     1     1942           1705       157     2124           1830
# ℹ 9,713 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
  • Show flights that departed on 01 January:
flights %>% 
  filter(month == 1 & day == 1) # "&" operator means 'and'
output
# A tibble: 842 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 832 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
Coding style

The above examples can also be rewritten in the form function(data frame, argument). This is only a viable alternative as long as no further functions are added (compared to, e.g., the example in 2.3).

filter(flights, dep_delay > 120)

filter(flights, month == 1 & day == 1)

2.2.1.2 arrange()

  • Rearrange rows based on column values:
flights %>% 
  arrange(dep_delay)
output
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    12     7     2040           2123       -43       40           2352
 2  2013     2     3     2022           2055       -33     2240           2338
 3  2013    11    10     1408           1440       -32     1549           1559
 4  2013     1    11     1900           1930       -30     2233           2243
 5  2013     1    29     1703           1730       -27     1947           1957
 6  2013     8     9      729            755       -26     1002            955
 7  2013    10    23     1907           1932       -25     2143           2143
 8  2013     3    30     2030           2055       -25     2213           2250
 9  2013     3     2     1431           1455       -24     1601           1631
10  2013     5     5      934            958       -24     1225           1309
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
  • Arrange the column values in descending order:
flights %>%  
  arrange(desc(dep_delay))
output
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     9      641            900      1301     1242           1530
 2  2013     6    15     1432           1935      1137     1607           2120
 3  2013     1    10     1121           1635      1126     1239           1810
 4  2013     9    20     1139           1845      1014     1457           2210
 5  2013     7    22      845           1600      1005     1044           1815
 6  2013     4    10     1100           1900       960     1342           2211
 7  2013     3    17     2321            810       911      135           1020
 8  2013     6    27      959           1900       899     1236           2226
 9  2013     7    22     2257            759       898      121           1026
10  2013    12     5      756           1700       896     1058           2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

2.2.2 Columns

2.2.2.1 mutate()

  • Add new columns to the data frame based on existing ones:
flights %>%  
  mutate(
    gain = dep_delay - arr_delay, # add gain column
    speed = distance / air_time * 60, # add speed column
    .before = 1 # add new columns to the left of the df
  )
output
# A tibble: 336,776 × 21
    gain speed  year month   day dep_time sched_dep_time dep_delay arr_time
   <dbl> <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1    -9  370.  2013     1     1      517            515         2      830
 2   -16  374.  2013     1     1      533            529         4      850
 3   -31  408.  2013     1     1      542            540         2      923
 4    17  517.  2013     1     1      544            545        -1     1004
 5    19  394.  2013     1     1      554            600        -6      812
 6   -16  288.  2013     1     1      554            558        -4      740
 7   -24  404.  2013     1     1      555            600        -5      913
 8    11  259.  2013     1     1      557            600        -3      709
 9     5  405.  2013     1     1      557            600        -3      838
10   -10  319.  2013     1     1      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2.2.2.2 select()

  • Choose the variables to be used for further analysis and, conversely, exclude all others:
flights %>% 
  select(origin, dest)
output
# A tibble: 336,776 × 2
   origin dest 
   <chr>  <chr>
 1 EWR    IAH  
 2 LGA    IAH  
 3 JFK    MIA  
 4 JFK    BQN  
 5 LGA    ATL  
 6 EWR    ORD  
 7 EWR    FLL  
 8 LGA    IAD  
 9 JFK    MCO  
10 LGA    ORD  
# ℹ 336,766 more rows

2.2.2.3 rename()

  • Change the name of an existing column with rename(). For example, replace the old column names of dep_time, sched_dep_time and dep_delay with dep.time, sched.dep.time and dep.delay, respectively.
flights %>%
  rename(dep.time = dep_time,
         sched.dep.time = sched_dep_time,
         dep.delay = dep_delay
         ) # new name = old name
output
# A tibble: 336,776 × 19
    year month   day dep.time sched.dep.time dep.delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

2.2.2.4 relocate()

  • The column arguments of relocate() are moved to different positions in the tibble. Without additional specification, the function moves them to the beginning:
flights %>% 
  relocate(day, month, year)
output
# A tibble: 336,776 × 19
     day month  year dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1     1     1  2013      517            515         2      830            819
 2     1     1  2013      533            529         4      850            830
 3     1     1  2013      542            540         2      923            850
 4     1     1  2013      544            545        -1     1004           1022
 5     1     1  2013      554            600        -6      812            837
 6     1     1  2013      554            558        -4      740            728
 7     1     1  2013      555            600        -5      913            854
 8     1     1  2013      557            600        -3      709            723
 9     1     1  2013      557            600        -3      838            846
10     1     1  2013      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
  • The optional arguments .before() and .after() provide explicit instructions for column positioning. The default values are NULL.
flights %>% 
  relocate(year:dep_time, .after = sched_dep_time)
output
# A tibble: 336,776 × 19
   sched_dep_time  year month   day dep_time dep_delay arr_time sched_arr_time
            <int> <int> <int> <int>    <int>     <dbl>    <int>          <int>
 1            515  2013     1     1      517         2      830            819
 2            529  2013     1     1      533         4      850            830
 3            540  2013     1     1      542         2      923            850
 4            545  2013     1     1      544        -1     1004           1022
 5            600  2013     1     1      554        -6      812            837
 6            558  2013     1     1      554        -4      740            728
 7            600  2013     1     1      555        -5      913            854
 8            600  2013     1     1      557        -3      709            723
 9            600  2013     1     1      557        -3      838            846
10            600  2013     1     1      558        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% 
  relocate(starts_with("arr"), .before = sched_dep_time)
output
# A tibble: 336,776 × 19
    year month   day dep_time arr_time arr_delay sched_dep_time dep_delay
   <int> <int> <int>    <int>    <int>     <dbl>          <int>     <dbl>
 1  2013     1     1      517      830        11            515         2
 2  2013     1     1      533      850        20            529         4
 3  2013     1     1      542      923        33            540         2
 4  2013     1     1      544     1004       -18            545        -1
 5  2013     1     1      554      812       -25            600        -6
 6  2013     1     1      554      740        12            558        -4
 7  2013     1     1      555      913        19            600        -5
 8  2013     1     1      557      709       -14            600        -3
 9  2013     1     1      557      838        -8            600        -3
10  2013     1     1      558      753         8            600        -2
# ℹ 336,766 more rows
# ℹ 11 more variables: sched_arr_time <int>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

2.2.3 The pipe

  • Depending on the situation, it may be necessary to combine various modes of data manipulation.
flights %>% # data frame
  filter(dest == "IAH") %>%   # only flights to IAH
    mutate(speed = distance / air_time * 60) %>%   # create column with speed
      select(year:day, dep_time, carrier, flight, speed) %>%   # choose variables
        arrange(desc(speed)) # sort values in descending order
output
# A tibble: 7,198 × 7
    year month   day dep_time carrier flight speed
   <int> <int> <int>    <int> <chr>    <int> <dbl>
 1  2013     7     9      707 UA         226  522.
 2  2013     8    27     1850 UA        1128  521.
 3  2013     8    28      902 UA        1711  519.
 4  2013     8    28     2122 UA        1022  519.
 5  2013     6    11     1628 UA        1178  515.
 6  2013     8    27     1017 UA         333  515.
 7  2013     8    27     1205 UA        1421  515.
 8  2013     8    27     1758 UA         302  515.
 9  2013     9    27      521 UA         252  515.
10  2013     8    28      625 UA         559  515.
# ℹ 7,188 more rows

2.2.4 Groups

2.2.4.1 group_by() and summarize()

  • group_by() creates a grouped data frame that is organised by a variable of your choice (e.g., “month”).
flights %>%  
  group_by(month)
output
# A tibble: 336,776 × 19
# Groups:   month [12]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
  • Use summarize() to create a new data frame with rows for each combination of grouping variables.
flights %>%  
  group_by(month) %>% 
  summarize(
    delay = mean(dep_delay, na.rm = TRUE), # second argument removes missing values
    n = n() # shows number of rows in this group
  )
output
# A tibble: 12 × 3
   month delay     n
   <int> <dbl> <int>
 1     1 10.0  27004
 2     2 10.8  24951
 3     3 13.2  28834
 4     4 13.9  28330
 5     5 13.0  28796
 6     6 20.8  28243
 7     7 21.7  29425
 8     8 12.6  29327
 9     9  6.72 27574
10    10  6.24 28889
11    11  5.44 27268
12    12 16.6  28135